Harmonisation of variables names prior to conducting statistical analyses with multiple datasets: an automated approach
نویسنده
چکیده
BACKGROUND Data requirements by governments, donors and the international community to measure health and development achievements have increased in the last decade. Datasets produced in surveys conducted in several countries and years are often combined to analyse time trends and geographical patterns of demographic and health related indicators. However, since not all datasets have the same structure, variables definitions and codes, they have to be harmonised prior to submitting them to the statistical analyses. Manually searching, renaming and recoding variables are extremely tedious and prone to errors tasks, overall when the number of datasets and variables are large. This article presents an automated approach to harmonise variables names across several datasets, which optimises the search of variables, minimises manual inputs and reduces the risk of error. RESULTS Three consecutive algorithms are applied iteratively to search for each variable of interest for the analyses in all datasets. The first search (A) captures particular cases that could not be solved in an automated way in the search iterations; the second search (B) is run if search A produced no hits and identifies variables the labels of which contain certain key terms defined by the user. If this search produces no hits, a third one (C) is run to retrieve variables which have been identified in other surveys, as an illustration. For each variable of interest, the outputs of these engines can be (O1) a single best matching variable is found, (O2) more than one matching variable is found or (O3) not matching variables are found. Output O2 is solved by user judgement. Examples using four variables are presented showing that the searches have a 100% sensitivity and specificity after a second iteration. CONCLUSION Efficient and tested automated algorithms should be used to support the harmonisation process needed to analyse multiple datasets. This is especially relevant when the numbers of datasets or variables to be included are large.
منابع مشابه
Accuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)
Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...
متن کاملAutomated Detection of Multiple Sclerosis Lesions Using Texture-based Features and a Hybrid Classifier
Background: Multiple Sclerosis (MS) is the most frequent non-traumatic neurological disease capable of causing disability in young adults. Detection of MS lesions with magnetic resonance imaging (MRI) is the most common technique. However, manual interpretation of vast amounts of data is often tedious and error-prone. Furthermore, changes in lesions are often subtle and extremely unrepresentati...
متن کاملA Semi-Automated Algorithm for Segmentation of the Left Atrial Appendage Landing Zone: Application in Left Atrial Appendage Occlusion Procedures
Background: Mechanical occlusion of the Left atrial appendage (LAA) using a purpose-built device has emerged as an effective prophylactic treatment in patients with atrial fibrillation at risk of stroke and a contraindication for anticoagulation. A crucial step in procedural planning is the choice of the device size. This is currently based on the manual analysis of the “Device Landing Zone” fr...
متن کاملA New Method to Improve Automated Classification of Heart Sound Signals: Filter Bank Learning in Convolutional Neural Networks
Introduction: Recent studies have acknowledged the potential of convolutional neural networks (CNNs) in distinguishing healthy and morbid samples by using heart sound analyses. Unfortunately the performance of CNNs is highly dependent on the filtering procedure which is applied to signal in their convolutional layer. The present study aimed to address this problem by a...
متن کاملAugmented Display of Anatomical Names of Bronchial Branches for Bronchoscopy Assistance
This paper presents a method for an automated anatomical labeling of bronchial branches (ALBB) for augmented display of its result for bronchoscopy assistance. A method for automated ALBB plays an important role for realizing an augmented display of anatomical names of bronchial branches. The ALBB problem can be considered as a problem that each bronchial branch is classified into the bronchial...
متن کامل